Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

نویسندگان

G. Janakiraman

Yuval Tamir

چکیده

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined only by the reliability requirements of the application. Efficient adaptation of this approach to DSM multicomputers is complicated by the absence of explicit messages in DSM systems, the presence of a shared and partially replicated address space, and the presence of a distributed coherency directory. We present solutions to these issues, and propose an error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers. Our performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previously proposed for DSM systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rollback Recovery Scheme for Distributed Shared Memory Clusters

In this paper, an unified lightweight error recovery scheme based on coordinated checkpointing and rollback for distributed shared memory clusters is proposed. The new scheme maintains multiple globally consistent checkpoints of the state of a distributed shared memory cluster and recovers to a pre-fault checkpoint of the system. It also describes and evaluates the coordinated checkpointing. Th...

متن کامل

Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback

For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpointing and rollback, is often used. During failurefree operation, the process states are regularly saved, and after a fault is detected, the system is rolled back to a previously saved state. We can distinguish four classes of techniques: semi-automatic techniques, message logging, coordinated ch...

متن کامل

Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems

Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing met...

متن کامل

Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems

In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implement fault-tolerance based on Recovery Blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino-effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One s...

متن کامل

Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however, the probability of software DSM failures increases as the system size grows. This paper presents a new, efficient logging protocol for adapti...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

نویسندگان

چکیده

منابع مشابه

Rollback Recovery Scheme for Distributed Shared Memory Clusters

Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback

Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems

Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems

Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

عنوان ژورنال:

اشتراک گذاری